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ABSTRACT 

The issue of multiple -choice (HC) vs, 
created-response (CR) test-item formats was reexamined at the 
^ighth^grade level in three subject areas: general science, American 
history, and arithmetic. In each subject area, alternate forms with 
the same item7Content but differing in which items were in which 
format were prepared from standardized tests. Between 269 and 289 
students took each form. Measurement equivalence was substantiated by 
correlations corrected for attenuation between MC and CR items. 
Subgroups composed by sex, intelligence, and socioeconomic status 
(S.E.S.) showed no interactions with relative MC vs. CR 
discrimination, but one interaction was found with relative 
difficulty. In arithmetic, CR items were relatively more difficult 
(than MC items) for lower than for higher S.E.S. students. Comparison 
of overall item-test discrimination favored the CR items in 
arithmetic and history, but there was no difference in science. 
(Author) 



! 



ERIC 



rvi 
\ — I 

ON 

o 



CO 
O 

o 



U S DEPARTMENT OF HEALTH, 
EDUCATION AWELPAHE 
NATIONAL INSTITUTE OF 
EDUCATIDN 

tHiS DOCUMENT HAS BEEN REPHO 
OUCeO EXACTLY AS HECEIVEO THOM 
IHE PERSONOH ORGANlZATtON ORIGIN 
ATINGII POINTS OF VIEW OR OPINIONS 
STATED 00 NOT NECESSARILY REPRE 
. , SENT OFFICIAL NATIONAL INSTITUTE OF * 

AUlt POUCATIDN POSITION OR POLICY . 



mhTJrUKtlOlCB VliRSUS CREATUD-RliSPONS.Fi TEST ITL:MS 

ABSTR.VCT 

The issue of multiple-choice (MC) vs. created-response (CR) test- 
item for:.;ats was rc-oxar.ined at the ei{j^."^h-grado le\*el in three subject 
arens: General Science, Anni-ican History, and ArithriOtic* In eacli 
subject area, alternate frorr.s with the Sciiuc item- content but differing 
in vdiich items v/ere in which format were prepared from standardized 
tests. Between 269 and 289 students took each form. Measurement 



equivalence was substantiated by correlations- corrected for attenuation 
between MC and CR itenis ranging from .90 to 1.04 (mean = .99). Sub- 
groups coiv^poscd by sex, intelligence » and scio-cconojnic status . 
(S.E.S.) showed no interactions with relative MC vs. CR discrimination, 
but one interaction was found with relative difficulty. ,In Arithmetic, 
CR items were relatively more difficult (^than MC items) for lower 
than for higher S.E.S. students. Comparisons of overall item-test 
discrimination favored the CR items in Arithmetic and History, but 
there was no difference in Science. 
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MULTIPLE-aiOlCB VERSUS CREATED-RIiSFCN'SE TEST ITEMS 

LESLIE H. AULT^ 

2 

Teachers Colloge--Columbia University 

♦ 

Hie v;idcspread use of multiple-choice tests in Ajiicrlca followed 
the success of the Arp.y ^'Alpha^' test during V/orld War One. lliese 
"new- type'' tests were attacked at tlie time and are still attacked 
now on gi^ounds that they encoura^^e superficial learning and dilute 
the educational process. Nevertheless, the acceptability of multiple- 
choice tests \<as established by numerous empirical studies investi- 
. gating tlieir psychometric properties during the 1920 's and early 1930 's. 
Ruch (1929). is a good source for descriptions of many of these early 
studies. A typical early study consisted of administering a set of 
items in open-ended or created-rcsponse format, and then on a later day 
administering the same items to the saiiie examinees but in multiple- 
choice or true-false format, v;ith the result that the "new- type" tests 
were found to'haye reliabilities about as high as tlie created-response 
test and to correlate highly with it. As Lindquist (^969) has pointed 
out, these technical justifications combined with mechanical scoring 
capability to establish the multiple-choice item as tlic dominant type, 
a development that ignored the probability that "every type of test, 
exercise is superior to every other type for some specific purpose or 
purposes (p. 555) ." • . 



The autlior is indebted to Dr. Ulizabotli H.Tgcn, under whose chair- 
manship the dissertation on which tliis article is based was ^developed. 

^Now at Hostos Coni;r.unity College of tlic City University of New York. 



Since the 19/5rVs enipiriccJl studies of test-ite:« types or fonnate^.. 
have been i^clativoly infrequent, and many pertinent studies had some 
other issue as their main purpose. A notable exception is a disserta- 
tion study bv Cock (19S5) , who reported correlations corrected for 
attenuation of .95 to 1.00 betu-een multiple-choice and open-ended 
versions of contemporary affairs items given to college freshmen. 
Howe'i'er, the results of some studies have been less clear-cut, includ- 
ing reports that American college students did relatively better on 
multiple-choice tests than did British students, who in -turn did 
relatively better on essay tests (Vernon, 1962);. of several low~-as • 
low as .22 — correlations between arithmetic items from standard 
multiple-choice tests and open-ended counterparts given to fourth- 
graders (Villlianson 5 Hopkins, 1967); and of higher reliability for 
an open-ended geometry test than for any of three multiple-choice 
versions (Owens, IK-Jnna, and Coppedge, 1970). Tliese reports provided 
indications that a further study might be worthv;hile. In addition, a 
systemmatic study could employ methodological improvements (such factor 
analysis and one test to a subject) over the old studies. -The present 
study was intended as a reexamination of the measurement properties of 
multiple-choice (MC) and created-response. (CR) test-item formats. 



llio tests were at the junior-high level, where niost of the items 
are suitable for tranj>latxoTi into CR foi'^mat and where there is a mix 
between straiglit factual items and one requiring more sophistication 
to answer. Jests at higher grade levels have many items unsuitable 
for trans latiV)n ir.to CR format, while tests at lower grade levels seem 
to have a preponderance of straight factual knowledge. Examples of 
items in each sul)ject arc given below. 

Science 

MC paired (Form R, /.^3): which of the following diseases is carried 
by mosquitos? / A Cancer / *B Malaria / C Heart disease / D tuber- 
culosis / E Pneu:?.onia (p - .71, bis = .6S) 

LR paired (Form S, Z'S) : Uliat disease is commonly carried by mosquitos? 
(p = .71, bis = .65) 

MC untranslated (Form R, /^7; Form S, 117): Eva-poration of water will take 
place fastest on a day which is / *A hot and dry. / B hot and moist. / 
C cold and dpy. / D cold and moist. / E ' ariable in moisture and 
temperature. (p ^ .75, .79; bis = .39, .48) 

History 

. MC paired (Form W, :^S); Patriots in the Revolutionary Vx'ar received 
important financial and military aid from the / A Indians. / *B French. 
/ C Loyalists. / D Russians, (p = ,5S, bis = .57) 

CR paired (Form V, ''S) : From whom did tho patriots; in the Revolutionary 
War receive importctiit financial and military aid? (p ^ .50, bis .80) 
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MC luitranslated (Poria V, H; Form W, itl)\ The developrr.cnt of corununi-' 
cation was furthered by the inventions of all of the following men 
except / A Guglielino Marconi / B Alexander '.G. Bell / C SJ'\B. Morse / 
*a Elias Hot'/e. (p .36, .36; bis = .55, .46) 

Aritlimctic 

MC paired (Form Y, /{I); Jim cuts a 15.6-inch length of copper 
pipe into 6 equal lengths. How many inches long is each piece? / A .026 
/ B .26 / C 2 / *D 2.6 / E 15 (p = .74, bis = .53), 
CR paired (Form X, /.-S): Jim cuts a 15.6-inch length of copper pipe 
into 6 equal lengths, iiow- many inches long is each piece? (p = .49, 
bis = .74) • ' . 

MC untranslated (Form X, "38; Form Y, //38) : VHiich of tlie following 
products must be an odd number? /. A, 99,918 x 99,917 / B 99,918 x 
99,921 / C 99,926 x 99,921 / D 99,926 x 99,926 / ^E 99,929 x 99,933 
(p .46, ,46; bis = .44, .55) 



METHOD 

Spocially-nade tests witli V.C and CR itenus v;ere prepared in General 
Science, American History, and Arithmetic, Tlie iteins were taken from 
the Educatianr.l Testing Service's Cooperative Tests, with many of the 
original MC items f.-anslatod into CR format. In each subject area, 
two alternate iorr.\s v;crc assembled with the s-ame item-content but dif- 
fering i^ whicli items were in v;hicli format. Thus each test form 
contained (a)' sojne itens.in CR format appearing in MC forjnat in the 
alternate form, (b) some items in MC format appearing in CR format in 
the alternate form, and (c) some item.s appearing in MC format in 
both forms. The items in the last group "could not be translated into 
equivalent MC items, but were used as ^'anchor" items. On the presump- 
tion that CR items would take longer to answer than MC items, a few 
items (least desirable statistically) were dropped from the original 
test forms in order to maintain the sam.e time limit for administrative 
purposes, Further details are given in Table 1. 



Item Categories, by Form 



TABLE 1 



Science Histoiy Arithmetic 
form: R S V V/ X • Y 



Item Categories 



(?) CR with MC p:-iirs 


16 


16 


18 


18 


16, 


16 


(b) MC with CR pairs 


16 


16 


18 


18 


16 


16 


(c) MC untranslated 


18 


IS 




24 


15 


13 


Total items 


SO 


50 


60 


60 


45 


45 



The cxardnees were the entire oij\hth grade in a suburban New 
York school. The tests? v:ere administered on separate days for each 
subject area utidcr the direction of the regular teachers; Tlie tests 
were distributed with the alternate forms in alternating order during 
the nornnl class period with a 40--^T\inutc time limit. Most students 
tool: one tc5t in each of three subject areas, but some took only two 
tests, some only one, and a few none, depending on their attendance 
pr'^tern. 

In addition, sex, age, intelligence, and socio-economic status 
(S.E.S.) v/ere obtained for the students. Sex and age were suppli^xl 
by the students on the cover of the test booklets. Intelligence test 
scores were obtained from the sciiool records in the form of stanincs 
on the Lorge-Tliomdike, or from other test results in a few cases. 
S.E.S was based on father's occupation (witii I'cference to father's 
education and mother ^s occupation and education where liolpful) as 
supplied by the students on the test booklet and as listed in the 
school records, A three- level categorization w^as made using Blau 
and Duncan's (1967) table broken into thirds. 

A sumifiary of tlie i;umbers and cha.ract oris tics of the samples by 
test form is shown in Table 2. Some of the differences wore notice- 
ably large, but none were statistically significant at the .01 level, 
permitting comparisons to be made across equivalent samples. • 



TABLE 2 

Niuiibers and Charnctcri tics of Examinees, by Form 







Science 


History 


Arithmetic 


form', 

• ' . .■ I. . ■ 


R 


S 


V 




X 


Y 


Total KuiTibci' of IZxaminccs- 


289 


274 


284 


276 


276 


269 


Extmiinec Characteristics 














% Male 


49.1 


59.5 


53.9 


54.0 


54.0 


53.5 


% Age 13 


83.0 


78. S' 


.77.5 


83.0 


81.1 


78.8 


Intellijjcnce: mean 














staninc 


5,66 


5.47 ■ 


5.63 


5.38 


5.66 


5.43 


standard deviation 


1.87 


1.7S 


1.81 


1.79 


1.74 


1.87 


S.E.S.: 3. upper (-6) 


24.9 


22.6 


23.2 


2.4.6 


24.2 


25.3 


2. middle (^0 


37. 0' 


37.6 


39.1 


35.5 


37.0 


36.8 


1. lower C%) 


37.7 


39.8 


37.3 


39.9 


38.4 


37.9 



Note: Age, intelligence, or S.E.S. v;as not known for no more than two 
students per test form. 



The students ni;swcrcd directly in the test booklets by circling 
the letter corresponding to their choice or by writing in a word, phrase, 
or number. The correct anc?v;ers to the CR items v;cre typically sliort 
and fairly concrete, making their scoring higlily objective. The scoring 
was cliecked by conparison of cedes given by two independent scorers for 
a sample of 20 tests for each form. After correction of a few scoring 
inconsistencies thus uncovered, the remaining '^scorijig error^* on CR 
items amounted to 12 errors out of 2000 entries-tolerably low, Tlie 
scoring error on the MC itens v;3S 5 errors out of <1200 entries, either 
transcription mistakes or hard-to- judge circles. 
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RESULTS . 

The analyses of the dnta were ainicd at four inain questions: (1) 
whether MC and CH itenis provide equivalent neasuremcnt, (2) how MC 
and CR items compare in iteni-test discrindnation, and wliother there 
are any differences among i^ubgroups divided by sex, intellij^ence, and 
S.n.S, between MC and CR itcras in (3) difficulty and (4) item-test 
discrimination. 

Measurement Enuiv alrnce o^f M£ ajid CR I toi^s^ 

The simple and direct way to investigate measurement equivalence 
is to correlate scores on the MC and CR items, Tliis was done v/itliin 
each test form, v;ith the h\C items divided into the "paired" and "un- 
translated'^ categories. As shown in Table 3, the six correlations 
between the CR and ?.!C-paircKl subsets ranged between .66 and ,80 raw, 
but between .90 and 1.04 (with a mean of .99) after correction for 
attenuation. In addition, more often than not ^lie CR items correlated 
more highly with the MC-untranslated items than did the MC-paired items, 
thus providing no indication of differences between the formats. On 
the basis of the correlations, the MC and CR formats did provide 
equivalent m.easurement in this study. - • 

The issue of measurement equivalence was also examined by factor 
analysis. For oacli test form, a principal components analysis with 
varimax rotation was porfovined on the matrix of tetrachoric correlations 

ERIC 
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TABl.E 3. 

Itora iiubsot DifficuJty, lieliability, Intercorrolations, by Korm 







Sciouco 


Jlistory 


ArithiiKC'tic 


form: 




S 


V 


\'l 


X 


Y. 


Cll-pnired: 

• 














Mean difficulty 


o39 






ol9 


.39 


,39 


KR20 Hcliability 


o72 


• 71 


• 7G 


»G9 


,81 


,78 


JKJ-pii j.rods 














Mean difficulty 


o57 


oGO 


.51 


o50 


.56 


.51 


1CK20 i^eli ability 


o73 


c74 


cGG 


o70 


♦ 74 


,75 


MC-uatraniilated: 














Mean difficulty 


i>53 


tin 


.39 


,38 


,46 


.42 


KR20 Reliability 


• 74 


o77 


,70 


,70 


,75 


,80 


Intcrcorrelatioas : 






• 








CK-'paircU, KC-paired 
















c66 




.70 


o71 


,80 


.80 


cox*rected for attenuation 


o9U 


o95 


.98 


lo02 


1,02 


1,04 


CR^'paircd, MC untranslated 














raw 


.72 


o71 




,64 


o74 


. / Ji 


corrected for attcnufition 


of>8 


.92 


,92 


,89 


,97 


,95 


MG-j^aired, }XJ-untri)nsiat. cd 














.rai7 


o68 


c70 


.65 


,62 


,73 


. .73 


corrected for attenuf\tion 


o92 


o05 


o88 


,92 


.90 


.90 
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air.ong items, in an attcnpt to identify possible format-related factors. 
For five of tlic six test forms, the second and third factors accounted 
for only of the variance (the first factor is t>'pically a strong 

factor associated with v;luatever tlie tost is measuring) and showed no 
relationsliips with item format. On one of tlie Histoxy tests (Form V/), 
the second factor accounted for 10,3-4% of tlie variance (tlio first factor . 
is typical]y a strong factor associated with whatever the test is me 
measuring) and sliowed no relationships with itcra format. On one of 
the History tests (Form IV), the second factor accounted for 10,3% of 
the variance and showed a marked relation with item format in botli the 
unrotated find rotated structures, with the MC-paired item.? highly positive, 
the MC-untranslated items positive, and tlie CR items mostly negative in 
their loadings. The result on Form IV is interesting but unconvincing as 
a valid format factor in- view of the results on the other five test forms, 
Tliere is reason to associate the factor v;atli very difficult CR items, 
which occurred in greatest nurabers on Form W and contributed most of the 
negative loadings, . 

Relative Disc riminatio n of MC and CR Items ' ' ' • ' 

The relative discrimination of items in MC and CR formats was 
examined by comparing the item-test biserial find point-biserial 
correlations for each "item-pair'* in. its MC format and in its CR 
format, Tlie summary of these comparisons for each subject area is 
shown in Table 4, In Science, there v;as very little difference 



between the MC and CR formats in discrimination, using either the 
point-biserial or biserial correlations. ■ In Arithmetic, on the other 
hand, both measures favored the CR format in discrimination. In 
History, the^ com])arison using point-biserials shov;ed no difference, but 
the use of the biserial correlation shov;ed a substantial difference 
in favor of the CR format. This discrepancy resulted from the fact 
that many of the CR items in Histozy proved to be very difficult; 
the point-biserial, unlike the biserial, is markedly affected by the 
proportion correct, 'Hiese comparisons can also be judged roughly 
from the reliabilities shown in Table 3, 

Subgroup D ifferences in MC vs_. CR_ Difficulty otkI DiscriT ninat ion 

Despite overall" uniformities , there is the possibility that dif- 
ferent groups may perforni differently as a function of item format. 
This was investigated for subgroups composed by sex, intelligence, 
and S.E.S, For this purpose, the students v;ere divided into tv;o 
roughly equal groups on intelligence by stanines 6 and above, and 5 
and belo\s'. Item analyses were performed for each subgroup separately, 
and comparisons betv;een tlicni made using the difference in difficulty 
and in point-biserial correlation for the MC and CR formats of each 
item pair, Tliere were no significant differences between subgroups 
in relative .(MC vs , CR) discrimination, but there was one significant • 
interaction in relative difficulty. On Arithmetic, the CR items 
were relatively ir.ore difficult- (than the ^-'C itcr.isj for lover S,I:.S, 
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TABLE 4 

W VBf. CR Item Dinjcrimination Summary 

(Entries are baf^od oii IIC ininuS'Cil dinjcrimination for each item- 
pair , uj-vxiu; poin t-lji3orial and bit^erial ilcm-tost correlat ioui:^) 

Kuiabcr of items with dilTorence of: 

#12 or oO to *-oO to -^12 or moan 
more oil -dl more dijference — 



Science: 

ppint-bi.'^criul G 11 1.0 5 . c008 ©39 

biserial 7 8 8 9 -e017 o69 

History: 

point-biscrial 6 11 10 9 -0002 o09 

biserial 4 6 . 5 21 -o0C6 3o86** 

Arithmetic: 

poiut-bi serial 3 9 12 8 -o043 2c 11** 

biserial 5 6 7 14 -o074 2<vl0<' 



* P < p05 



* ♦ P < oOl 
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students than for Iiigher S.E.S. 'students. A logical e.^rplanation is 
that this effect was related to th(? overall difference in discrimina- 
tion In favor of Aritlimctic CR items. Mowever, S.E.S. correlated 
only ill theoO's with the total score and also with intelligence, . 
which in turn corre?.ated .72 with the total score but showed a 
v;eaj;cr and non"Si;;nificant differential difficulty.. Possible 
ocplanntions are greater coir>putaticnal accuracy or greater tendency 
to checl; or.O'S ansv;er among higher S.E.S. children. 

DISCUSSION 

The present study supports the coininonly-hcld notion that MC' and • 
CR itoias provide equivalent incasurement . Vrliere discriinination among 
exar.iinees is the ?.^ain pui^pose in testing, as where grades are to be 
assj.gnod or for correlational studies, the evidence . suggests that 
MC items can be used in place of CR iter.s ;^\ithout disrupting uiiat the . 
test is supposed to jneasurc. Such is not the case v/h.ere an absolute 
• rather than a relative standard is sought, as witl\ a criterion test or 
lAcrc the concept of '^process levels*' is considered important. . 

Tlie suggestion that CR iter,s may provide, better discrimination 
than their MC counterparts--at least in Arithmetic and /vi^ierican His- 
tor}''~-is important for measure:nent theory. Tlie effect, of course, 
would be to iirn[n-ove test reliability by using CR items instead of 
MC items, which vould be desirable if other things were equal. However, 
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there are also corisidorations of Jiiass scoring and administrative time. 
Obviously scoring tine becomes i?.ore important as. a consideration and 
favors MC items as the. numbers of examinees increase, Prcs'ent mechani- 
cal scoring capabilities for certain types of CR ansvrer.s, such as 
♦ 

described by Lindqui^-c (1969), are promising but unavailable for routine 
use. It would be. useful for some future rcsearcli in MC vs , CR compari- 
sons to ehploy siich machines and thus exert pressure for tlieir continued 
development. Administrative time assumes importance in that CR items 
apparently require more time than do MC" items. This extra time could 
also be used to add ite'ras to an MC test, thereby increasing" reliability 
to perhaps the same level as provided by a CR test within the same- 
testing rime. In the present study, estim.ntes indicate approximate 
equality in reliability for MC and CR items based on equal administrative 
time, but it is unknov;n whether the time-per-item could have been reduced 
somev;hat v/ithout unduly affecting overall reliability. In the Owens, 
Coppedgc, and Hanna (1970) study, administratiA'e time was equal and 
the CR version was superior in reliability to any of the three MC versions 
Further research in relative MC vs , CR discrimination should pay close 
attention to optimal administrative times, as well as examine tlie effects 
for other subject areas, age levels., testing settings, ard t>^:es of tests. 
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